A picture is worth a thousand words. Not until recently, however, we noticedsome success stories in understanding of visual scenes: a model that is able todetect/name objects, describe their attributes, and recognize theirrelationships/interactions. In this paper, we propose a phrase-basedhierarchical Long Short-Term Memory (phi-LSTM) model to generate imagedescription. The proposed model encodes sentence as a sequence of combinationof phrases and words, instead of a sequence of words alone as in thoseconventional solutions. The two levels of this model are dedicated to i) learnto generate image relevant noun phrases, and ii) produce appropriate imagedescription from the phrases and other words in the corpus. Adopting aconvolutional neural network to learn image features and the LSTM to learn theword sequence in a sentence, the proposed model has shown better or competitiveresults in comparison to the state-of-the-art models on Flickr8k and Flickr30kdatasets.
展开▼